-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Support batch export models as views #23052
refactor: Support batch export models as views #23052
Conversation
@pytest_asyncio.fixture(scope="module", autouse=True) | ||
async def create_batch_export_views(clickhouse_client, django_db_setup): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently clickhouse migrations don't really run in tests.
@@ -1,6 +1,5 @@ | |||
import datetime as dt | |||
import json | |||
import uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every change on this file is just cleaning up.
@@ -357,6 +359,23 @@ def stream_query_as_arrow( | |||
with pa.ipc.open_stream(pa.PythonFile(response.raw)) as reader: | |||
yield from reader | |||
|
|||
async def astream_query_as_arrow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in use yet, will use for person model exports.
# without battle testing it first. | ||
# There are already changes going out to the queries themselves that will impact events in a | ||
# positive way. So, we can come back later and drop this block. | ||
for record_batch in iter_records(client, team_id=team_id, is_backfill=is_backfill, **parameters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See TODO above: For now, keeping old functionality intact for safety. Will release person model first, test, and then switch events.
a2744da
to
4f182c7
Compare
@@ -77,78 +77,20 @@ def get_timestamp_field(is_backfill: bool) -> str: | |||
return timestamp_field | |||
|
|||
|
|||
async def get_rows_count( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't count rows anymore, so this function is unused. Cleaning up.
ecb38cb
to
dc9535f
Compare
posthog/batch_exports/sql.py
Outdated
PREWHERE | ||
COALESCE(events.inserted_at, events._timestamp) >= {interval_start:DateTime64} | ||
AND COALESCE(events.inserted_at, events._timestamp) < {interval_end:DateTime64} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😎
At this point, inserted_at should always be set for all batch exports. Only historical exports require _timestamp, but backfills have already been switched over to query based on timestamp, so they also do not need to check for inserted_at/_timestamp. Removing the colaesce and using only inserted_at reduces the size of the data CH has to fetch by half.
6769736
to
2eabe0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I'm excited to use Views here...easy to iterate and low risk.
) | ||
FORMAT ArrowStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield record_batch | ||
return | ||
|
||
async for record_batch in client.astream_query_as_arrow(view, query_parameters=parameters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice.
* refactor: Update metrics to fetch counts at request time * fix: Move import to method * fix: Add function * feat: Custom schemas for batch exports * feat: Frontend support for model field * fix: Clean-up * fix: Add missing migration * fix: Make new field nullable * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * fix: Bump migration number * fix: Bump migration number * refactor: Update metrics to fetch counts at request time * fix: Actually use include and exclude events * refactor: Switch to counting runs * refactor: Support batch export models as views * fix: Merge conflict * fix: Quality check fixes * refactor: Update metrics to fetch counts at request time * fix: Move import to method * fix: Add function * fix: Typing fixes * feat: Custom schemas for batch exports * feat: Frontend support for model field * fix: Clean-up * fix: Add missing migration * fix: Make new field nullable * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * fix: Bump migration number * fix: Clean-up unused code * chore: Clean-up unused function and tests * fix: Clean-up unused function * fix: HTTP Batch export default fields * fix: Remove test case on new column not present in base table * chore: Clean-up unused functions and queries * fix: Only run extra clickhouse queries in batch exports tests * refactor: Remove coalesce and use only inserted_at in queries At this point, inserted_at should always be set for all batch exports. Only historical exports require _timestamp, but backfills have already been switched over to query based on timestamp, so they also do not need to check for inserted_at/_timestamp. Removing the colaesce and using only inserted_at reduces the size of the data CH has to fetch by half. * fix: Remove deprecated test * fix: Add person_id to person model and enforce ordering * refactor: Also add version column --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Problem
This PR sets the ground work to support person batch exports by introducing a set of parametrized views that will serve as the basis of batch exports moving forward. This new set of views includes a view for the persons model. Using parametrized views over the old approach has the benefit of being easier to parse the query logic when it's laid out in plain SQL (ideally this should be SQL files, but our migration system works with Python files...).
Changes
Introduce new parametrized views for person and events batch exports.
GROUP BY
instead ofDISTINCT
. This should perform better and use less memory (in particular withoptimize_aggregation_in_order
enabled).PREWHERE
clause for filtering first oninserted_at
. Once again, this should result in better performance.timestamp
andinserted_at
filters. Should behave the same as before, but laid out in plain SQL instead of composed via Python formatting.Introduce new methods to iterate over records asynchronously.
pyarrow.RecordBatch
.👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Does this work well for both Cloud and self-hosted?
How did you test this code?